Automatic Discovery of Non-Compositional Compounds in Parallel Data
نویسنده
چکیده
Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of noncompositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method’s potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.
منابع مشابه
ar X iv : c m p - lg / 9 70 60 27 v 1 2 4 Ju n 19 97 Automatic Discovery of Non - Compositional Compounds in Parallel Data ∗
Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words th...
متن کاملAutomatic Discovery of Technology Networks for Industrial-Scale R&D IT Projects via Data Mining
Industrial-Scale R&D IT Projects depend on many sub-technologies which need to be understood and have their risks analysed before the project can begin for their success. When planning such an industrial-scale project, the list of technologies and the associations of these technologies with each other is often complex and form a network. Discovery of this network of technologies is time consumi...
متن کاملPetri Net Modeling for Parallel Bank ATM Systems
In this paper the real time operation of an automatic teller machine (ATM) is analyzed using aTimed Petri Net (TPN) model. In the modeling, the probability of arrivals, the speed andattentiveness of customers (clients) are taken to account. Different parameters are based onthe statistical data. The model is simulated for 24 hours. The diagrams of number ofsucceeded customers, failed references ...
متن کاملOn the Compositionality and Semantic Interpretation of English Noun Compounds
In this paper we present a study covering the creation of compositional distributional representations for English noun compounds (e.g. computer science) using two compositional models proposed in the literature. The compositional representations are first evaluated based on their similarity to the corresponding corpus-learned representations and then on the task of automatic classification of ...
متن کاملAutomatic implementation of a new recovery coefficient for Reliable contour milling
In contour milling, to render the machining process more automated with significant productivity without remaining material after machining, a new recovery coefficient was developed. The coefficient was inserted in the computation of contour parallel tool paths to fix the radial depth of cut in the way to ensure an optimized overlap area between the passes in the corners, without residuals. Thu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cmp-lg/9706027 شماره
صفحات -
تاریخ انتشار 1997